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Abstract 

Nonlinear non-Gaussian state-space models arise in numerous applications in statistics and signal processing. In this 
context, one of the most successful and popular approximation techniques is the Sequential Monte Carlo (SMC) algorithm, 
also known as particle filtering. Nevertheless, this method tends to be inefficient when applied to high dimensional problems. 
In this paper, we focus on another class of sequential inference methods, namely the Sequential Markov Chain Monte Carlo 
(SMCMC) techniques, which represent a promising alternative to SMC methods. After providing a unifying framework for 
the class of SMCMC approaches, we propose novel efficient strategies based on the principle of Langevin diffusion and 
Hamiltonian dynamics in order to cope with the increasing number of high-dimensional applications. Simulation results 
show that the proposed algorithms achieve significantly better performance compared to existing algorithms. 

Index Terms 

Bayesian inference, filtering. Sequential Monte Carlo, Markov Chain Monte Carlo, state-space model, high-dimensional. 


I. Introduction 

In many applications, we are interested in estimating a signal from a sequence of noisy observations. Optimal filtering 
techniques for general non-linear and non-Gaussian state-space models are consequently of great interest. Except in a 
few special cases, including linear and Gaussian state space models (Kalman filter Q) and hidden finite-state space 
Markov chains, it is impossible to evaluate the filtering distribution analytically. However, linear systems with Gaussian 
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dynamics are generally inappropriate for the accurate modeling of a dynamical system, since they fail to account for the 
local non-linearities in the state space or the dynamic changing nature of the system which is under study. It is therefore 
increasingly common to consider non-linear or non-Gaussian dynamical systems. In the case of additive Gaussian errors, 
one could adopt an Extended Kalman filter (EKE) or in the case of non-Gaussian additive errors, an Unscented Kalman 
filter (UKF) lE). 

Since the nineties, sequential Monte Carlo (SMC) approaches have become a powerful methodology to cope with 
non-linear and non-Gaussian problems 0. In comparison with standard approximation methods, such as the EKE, the 
principal advantage of SMC methods is that they do not rely on any local linearization technique or any crude functional 
approximation. These particle filtering (PE) methods a, exploit numerical representation techniques for approximating 
the filtering probability density function of inherently nonlinear non-Gaussian systems. Using these methods for the 
empirical characterization of sequences of distributions allows one to obtain estimators formed based on these empirical 
samples which can be set arbitrarily close to the optimal solution at the expense of computational complexity. 

However, due to their importance sampling based design, classical SMC methods tend to be inefficient when applied 
to high-dimensional problems 0-0. This issue, known as the curse of dimensionality, has rendered traditional SMC 
algorithms largely useless in the increasing number of high-dimensional applications such as multiple target tracking, 
weather prediction, and oceanography. 

As discussed in 0, some strategies have been developed in order to improve traditional SMC methods for filtering 
in high-dimensional spaces. The first well-known technique, called Resample-Move 0, consists in applying a Markov 
Chain Monte Carlo (MCMC) kernel on each particle after the resampling stage in order to diversify the degenerate 
particle population thus improving the empirical approximation. However, such a technique can become computationally 
demanding in high-dimensional system since only a single unique particle tends to be duplicated by the resampling step. 
This method therefore requires many MCMC iterations to obtain satisfactory results. Secondly, we can cite the Block 
Sequential Importance Resampling (Block SIR) approach in which the underlying idea is to partition the state space 
into separate subspaces of small dimensions and run one SMC algorithm on each subspace na, 0 -ini. However, this 
strategy introduces in the final estimates a difficult to quantify bias which depends on the position along the split state 
vector elements. Another strategy, called space-time Particle filter (STPF), has been recently proposed in ifT^ . The key 
idea of the STPF is to exploit a specific factorization of the posterior distribution to design a particle filter moving along 
both the space and time index (as opposed to traditional particle filter that moves only along the time index). However, 
since the local particle filters are running along space dimension, a patch degeneracy (on the space dimension) effect 
can be expected as the dimension of the system increases 0, na. 

A promising alternative class of methods for Bayesian filtering, known as Sequential Markov Chain Monte Carlo 
(SMCMC), has been proposed in several papers uni-iini and successfully applied to challenging applications ESI, ED- 
The main idea of such approaches lies in their ability to transform MCMC to an online inference method. In this paper, 
we firstly provide a unifying framework regarding the different SMCMC methods that have been proposed so far in the 
literature. Then, the optimal and alternative possible choices of MCMC kernel that can be used in practice are discussed. 
More importantly, we propose novel efficient strategies based on either Langevin diffusion or Hamiltonian dynamics in 
order to improve the efficiency of this class of SMCMC methods when dealing with complex high-dimensional systems. 
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Fig. 1: Graphical representation of an hidden Markov model 


The performance of these techniques is finally assessed and compared with other existing techniques on a challenging 
application in which a time-varying spatial physical phenomenon has to be tracked from a sequence of noisy observations 
coming from a large sensor network. 

This paper is organized as follows. Section |II] mathematically formulates the inference problem by introducing the 
hidden Markov model and discussing the general SMC methodology and its limitations to high-dimensional problems. 
Then Section|ni]describes another class of sequential inference algorithms based on the use of Markov chain Monte-Carlo 
methods (SMCMC) as an alternative to SMC methods. Section |IV] presents the proposed Langevin and Hamiltonian 
based SMCMC algorithms. Numerical results are shown in Section |V] Conclusions are given in Section |VT] 

II. Model Formulation: High Dimensional Hidden Markov Models 

A hidden Markov model (HMM) corresponds to a R'^-valued discrete-time Markov process, {Xn}n>i that is not 
directly observable, instead we have access to another IR.‘^“-valued discrete-time stochastic process, {i^r!,}„>i, which is 
linked to the hidden Markov process of interest through a model structure. Owing to the Markovian property of the 
process, the joint distribution of the process is given by, 

n 

P{xi,n) = P{xi) fk{Xk\Xk-l): ( 1 ) 

k^l 

which is completely defined by an initial probability density function (pdf) ^{xi) and the transition density function at 
any time k, denoted by fk{xk\xk-i)- 

In a HMM, the observed process is such that the conditional joint density of Yi-n = yi-n given Xi-^ = xi-n 

has the following conditional independence (product) form, 

n 

P{yi:n\xi-.n) = 9k{yk\Xk)- ( 2 ) 

k^l 

The dependence structure of an HMM can be represented by a graphical model shown in Figure [T] 

In the class of HMM models, one of the most common inference problems is known as optimal filtering, which involves 
the estimation of the current state value based upon the sequence of observations observed so far. Such inference about 
Xn given observations Yi-^ = yi-n relies upon the posterior distribution, 

! \ \ \ P{xi-.n,yi-.n) p{xi,n)p{yi:u\xi,n) ,,, 

P{yi:n) P{yi-.n) 
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At time n = 1, we have: 


P{xi\yi) = - 1— -. 

p{yi) 


Then, Vn > 2, the following recursive decomposition applies 


P{xi-n\yi:n) 


971 {yn \Xn')fniXn |x^_i ) 
Piyn\yi:n-l) 


p{xi.,n-l\yi-n-l), 


where 


p{yu\yi-.n-i) 


9n ijjn \ Xn')fn {Xn |^n— 1 ^piXn— 1 \ yi-.n — l')(^Xn — l:n ■ 


(4) 

(5) 

( 6 ) 


This recursion can also be presented as filtering as follows: 

9n{yn\Xn)p{Xn\yi:n-l) 


Pixn\yi:n) = 


with 


p{yn\yi-.n-l) 

P{Xn\yi:n—l^ — fn{Xn\Xn—l^p{Xn—l\yi-n — l)clX7i —I. 


(7) 

( 8 ) 


Here, we refer to the sequence of distributions as the target distributions for which we wish to calculate 

quantities like J tf{xn)T^n{xi-n)dxi-n for some bounded and integrable test function ip{-) : Rp with p > 1. Often 

in practice this must be done numerically through stochastic simulation solutions, the focus of the remainder of the 
paper. 


A. Problem Statement: Why do Sequential Monte Carlo (Particle Filter) Approaches Fail in Fligh Dimensions? 

We begin with a brief review of Sequential Monte Carlo (SMC) methods of which there are several variants 
sometimes appearing under the names of particle filtering or interacting particle systems e.g. EOl-lEa. In a typical 
SMC framework one wants to approximate a (often naturally occurring) sequence of target probability density func¬ 
tions (pdf) of increasing dimension, i.e. the support of every function in this sequence is defined as 

supp(TTnixi:n)) = and therefore the dimension of its support forms an increasing sequence with n. We may also 
assume that 7r„ is only known up to a normalizing constant, 

(9) 

SMC methods firstly provide an approximation of 7ri(a:i) and an unbiased estimate of Z\, then at the second iteration 
(“time step” 2) once a new observation is received, an approximation of tt 2 {xi-, 2 ) is formed as well as an unbiased 
estimate of Z 2 and this repeats with each distribution in the sequence. 

Let us remark at this stage that SMC methods can be used for any sequence of target distributions and therefore 
application of SMC to optimal filtering, known as particle filtering, is just a special case of this general methodology 
by choosing 7 „(xi:„) = p{xi:n,yi-.n) and = p(t/i:„). 

Under standard SMC methods, we initialize the algorithm by sampling a set of N particles, ’ from the 

distribution tti and set the normalized weights to W( = 1/N, for all j = 1,..., TV. If it is not possible to sample directly 
from TTi, one should sample from an importance distribution qi and calculate its weights accordingly the importance 
sampling principle, i.e. W( oc 7ri(X()/qi(X(). Then the particles are sequentially propagated through each distribution 
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TTn in the sequence via two main processes: mutation and correction (incremental importance weighting). In the first step 
(mutation) we propagate particles from time n — 1 to time n using and in the second one (correction) we calculate 
the new importance weights of the particles. 

This method, named Sequential Importance Sampling (SIS), can be seen as a sequence of importance sampling steps, 
where the target distribution at each step n is 7r„(a:i:„) and the importance distribution is given by 

n 

qn{xi-.n) = qi{xi) qk{xk\xi:k-l), (10) 


where qk{xk\xi:k-i) is the proposal distribution used to propagate particles from time k 
the unnormalized importance weights are computed recursively by: 


W{xi,n) 


ln{xi,n) 

Qn (xi:n ) 

'yn—l(xi:n—l) 'yn(xi:n) 

Qn—1 — yn—1 (xi:n — l)qn(Xn — 1 ) 


1 to k. As a consequence. 


( 11 ) 


where w{xi:n) is known as the incremental importance weight. When SMC is applied for the optimal filtering problem 
with 7 n(a:i:„) = p{xi.m yi:n), it is straightforward to show by using the recursion of the smoothing distribution in Eq. 
© that the incremental importance weight is given by: 


'Kxi:n) = 


Inixi-.r. 


gniyn\Xn)fniXn\Xn-l) 


( 12 ) 


'Jn — l {xi-^n—l^Qn {Xn | tClm —1) qn (Xn |2:i:n —1) 

At any time n, we obtain an approximation of the target distribution via the empirical measure obtained by the 
collection of weighted samples, i.e. 


N 


^n(xi:n) = Y^W^<^^j^(dxi:n), (13) 

j=l 

where is the normalized importance weights such that — 1- 

However, direct importance sampling on a very high dimensional space using SIS is rarely efficient, since the 
importance weights in Eq. exhibit very high variance. As a consequence, SIS will provide estimates whose variance 
increases exponentially with time n. Indeed, after only a few iterations, all but a few particles will have negligible 
weights thus leading to the phenomena known as weight degeneracy. A well known criterion to quantify, in an online 
manner, this degeneracy is the effective sample size defined as follows: 


ESSsMC.n (14) 

with 1 < ESSsMC.n < N. In order to overcome this degeneracy problem, an unbiased resampling step is thus added 
in the basic algorithm when the effective sample size drops below some threshold, which as a rough guide is typically 
in the range of 30 to 60 % of the total number of particles. The purpose of resampling is to reduce this degeneracy 
by eliminating, for the next time step, samples which have low importance weights and duplicating samples with large 
importance weights EH, m- It is quite obvious that when one is interested in the filtering distribution p{xn\yi-.n), 
performing a resampling step at the previous time step will lead to a better level of sample diversity, as those particles 
which were already extremely improbable at time n — 1 are likely to have been eliminated and those which remain 
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have a better chance of representing the situation at time n accurately. Unfortunately, when the smoothing distribution 
is really the quantity of interest, it is more problematic since the resampling mechanism eliminates some trajectories 
with every iteration, thus leading to a problem known as path or sample degeneracy. Indeed, resampling will reduce at 
every iteration the number of distinct samples representing the first time instant of the hidden Markov process. Since in 
filtering applications, one is generally only interested in the final filtering posterior distribution, this resampling step is 
widely used in practice at the expense of further diminishing the quality of the path-samples. 

This SMC algorithm which incorporates a resampling step is often referred to as Sequential Importance Resampling 
(SIR) or Sequential Importance Sampling and Resampling (SIS-R). This approach applied for filtering is summarized 
in Algorithm[T] By assuming that the cost of both sampling from the proposal and computing the weight is 0{Cd) (i.e. 
a function of the dimension of the hidden state), the cost of the general SMC algorithm is 0{nNCd). 


Algorithm 1 SMC algorithm for optimal filtering (SIR) 


1: if time n — 1 then 

2: Sample X{ ~ qi(xi) , Vj = 1, • • ■ ,N 

3: Calculate the weights Wl oc ^ yj _ ... jy 

4: else if time n > 2 then 


91 


5: Sample and set Xl^ := ,Vj = l,-- 

gMXL)UXi\Xi_^) 


q„{xL\xL 


Vi = 1, • 


,N 

,N 


6: Calculate the weights oc 

7: end if 

8: if ESSsMc.n < r then 

9: Resample {W^, X(.^j to obtain N equally weighted particles (W^ = l/N,X(..^j 

10: end if 

11: Output: Approximation of the smoothing distribution via the following empirical measure: 

N 

TT{Xl:n) « ^ {dXl,n) 

7 = 1 


Having introduced the basic SMC approach to inference, we note further the limitations of this approach to high¬ 
dimensional applications. These limitations become abundantly clear when an SMC method is directly applied to 
high-dimensional HMM inference problems. This poor performance typically manifests in extremely large variance of 
estimators and relates to the fact that the importance sampling paradigm is typically very inefficient in high-dimensional 
models. The main reason why the SIR algorithm performs poorly when the model dimension is high is essentially the 
same reason why the SIS algorithm behaves badly when the time-horizon is large. As discussed previously, the SIS 
algorithm is designed to approximate the smoothing distribution p{xi.n\yi-.n), therefore weight degeneracy occurs as 
n increases, even for state vector Xn € with low dimension d = 1,2,3, •••, since the dimension of this target 
distribution increases with time. It is therefore intuitive to translate this concept from the path-space (xi:„) to instead 
think of what occurs in terms of degeneracy at a single time as the state-space dimension increases (i.e. as the dimension 
d increases from d = 10,100,1000,...) and analogous degeneracy effects typically due to the high variability of the 
incremental weights defined in Eq. (fT^ . This can be exacerbated when non-linear and non-trivial dependence structures 
are present between the state vector sub-dimensions. In Q, ll2^ . a careful analysis shows that the collapse phenomenon 
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occurs unless the sample size N is taken to be exponential in the dimension, which provides a rigorous statement of 
the curse of dimensionality. 

In addition, we observe the widely known feature of SMC methods, principally that their performance strongly depends 
on the choice of the importance distribution. The “optimal” proposal distribution in the sense of minimizing the variance 
of the incremental importance weights in Eq. (fTSI) is defined as; 

qn{Xn\Xn-l) =piXn\yn,Xn-l) (15) 

which leads to the following incremental weight w{xi-,n) = p{yn\xn-i) whose variance conditional upon xi-,n-i is 
zero since it is independent of Xn- Unfortunately, in many scenarios, it is impossible to sample from this “optimal” 
distribution. Many techniques have been proposed to design “efficient” importance distributions qn{Xn\Xn-i) which 
approximate p{xn\ymXn-i)- In particular, approximations based on the Extended Kalman Eilter or the Unscented 
Kalman Eilter to obtain importance distributions are very popular in the literature ll25ll . While the practical performance 
of the SIR algorithm can be largely improved by working with importance distributions that are tailored to the specific 
model being investigated, the benefit is limited to reducing the constants sitting in front of the error bounds, and this 
technique does not provide a fundamental solution to the curse of dimensionality ca, ca. 

A possible solution is to use Markov Chain Monte Carlo (MCMC) algorithms within SMC methods, which is 
a well known strategy to improve the filter performance. As discussed previously, resampling stages progressively 
impoverish the set of particles, by decreasing the number of distinct values represented in that set. Therefore, to try 
to combat this progressive impoverishment it has historically been addressed using the Resample-Move algorithm HI. 
The resampling-Move algorithm consists of applying one or more times after the resampling stage an MCMC transition 
kernel, ICn{xi.n,x[.^), such as a Gibbs sampler or Metropolis-Hastings scheme Il28ll . having 7r„(a;i:„) as its stationary 
distribution. This means that the following property holds: 

J T^n{xi.,n)JCn{xi.,n,x[.^)dxi.,n =7r„(Xi.„). (16) 

As a consequence, if the particles Xf.^ are truly drawn from 7r„(a;i:„), then the Markov kernel applied to any of the 
particles will simply generate new state sequences which are also drawn from the desired distribution. Moreover, even if 
the particles are not accurately drawn from 7r„(a;i:„), the use of such Markov transition kernel will move the particles so 
that their distribution is closer to the target one (in total variation norm). The use of such MCMC moves can therefore 
be very effective in reducing the path degeneracy as well as in improving the accuracy of the empirical measure of the 
posterior distribution. In practice for filtering problems, in order to keep a truly online algorithm with a computational 
cost linear in time, the Markov transition kernels will not operate on the entire state history, but rather on some fixed time 
lag L > 1 by only updating the variables Xn-L+i-.n- The computational complexity of this algorithm is 0{nNKM^) 
with the computational cost of a single iteration of a MCMC kernel on the state Xn-L+i-.n and K the number of 
MCMC iterations applied to each particle. Nevertheless, in high-dimensional problems, only one particle will typically 
have a non-zero weight leading after the resampling to the duplication of N identical particles. As a consequence, 
this strategy will consist in running N MCMC chains in parallel with the same starting point and thus can be quite 
computationally demanding as more iterations of these MCMC moves will be required in order to have an accurate 
approximation of the posterior distribution. 
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A recent review of other alternative solutions has been written in Q. However, none of these approaches solve all 
of the challenges discussed above. We therefore need a new paradigm to tackle the increasing number of applications 
requiring reliable and practically useable high-dimensional hltering methods. 

III. Sequential Markov Chain Monte Carlo: A recursive high-dimensional solution 

One of the most promising new approaches to the modification of SMC methods to tackle high-dimensional sequential 
hltering problems lies in the new class of methods known as Sequential Markov chain Monte Carlo methods, see recent 
discussions in Q. This class of methods aims to combine the recursive nature of SMC methods (which make them 
efficient for online inference problems) with the effectiveness of Markov chain Monte Carlo (MCMC) methods for 
dealing with high-dimensional sampling problems. 

Unlike importance sampling used in standard SMC methods, the traditional class of MCMC sampling methods is 
highly efficient for sampling high-dimensional spaces, if designed properly, but it is unable to do this in a recursive 
fashion that is required for online sampling from sequences of distributions such as in the high-dimensional HMM setting 
considered in this paper. The success of MCMC methods lies in their ability to perform local moves of the exploratory 
sampler (Markov chain), possibly within sub-dimensions of the state vector, as opposed to proposing independently the 
entire state-vector in a single mutation update, as is typically required by SMC methods. Then the bias correction is 
made not via an importance sampling weight correction but instead via a rejection sampling mechanism. Their traditional 
formulation, however, allows sampling from probability distributions in a non-sequential fashion. 

However, recently advanced sequential MCMC schemes were proposed in ini-iiTi for solving online hltering 
inference problems. These approaches are distinct from the Resample-Move algorithm HI where the MCMC algorithm is 
used to move samples following importance sampling resampling since these sequential MCMC use neither resampling 
nor importance sampling. 

A. General Principle 

In this section, we will describe a unifying framework that include all of the sequential MCMC (SMCMC) methods 
that have been proposed so far. The underlying idea of all these SMCMC approaches is to perform a Metropolis-Hastings 
(MH) accept-rejection step as a correction for having used a proposal distribution to sample the current state in order 
to approximate the posterior target distribution as opposed to SMC methods that use a correction based on Importance 
sampling. 

At time step n, the target distribution of interest to be sampled from is 

p{^l:n\yi:n^ ^ Hn (Z/n l^n—1) P(^l:n —1 |yi:n—1) • (17) 

TTriGi-.n) TTn _ i (x i, „ _ i ) 

Unfortunately, it is impossible to sample from p(a;i.„_i |?/i.„_i) since this distribution is analytically intractable. The key 
idea of all existing SMCMC methods is therefore to replace p{xi.,n-i\y\-.n-i) by an empirical approximation obtained 
from previous iterations of the algorithm in the previous recursion. Under this approach, at time step n, the distribution 
of interest is therefore dehned as: 

^nixi-.n) (X gn{yn\Xn)fniXn\Xn-l)^n-lixi-.n-l), (18) 
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with the empirical approximation 


where corresponds to the N samples of the Markov chain obtained at the previous (n— l)-th time 

step for which the stationary distribution was Let us remark that this target distribution converges to the 

true posterior distribution (i.e. 7f„ —>• Wn) as 7f„_i —>• 7r„_i. By using this empirical approximation of the previous target 
distribution, an MCMC kernel can be employed in order to obtain a Markov chain, denoted by ...), 

with stationary distribution 7f„(xi:„) as defined in Eq. (fTsT l. 

As summarized in Algo. |2] the SMCMC proceeds as follows. At time step n = 1, an MCMC kernel /Ci of invariant 
distribution 7ri(a;i) cx pr(yr |a;r)/r(xr) is employed to generate a Markov chain denoted by i, ■ • ■, . At time 

step n, the N + Nb iterations of the SMCMC aims at producing a Markov chain, denoted by (^X^ 
by using an MCMC kernel of invariant distribution Wn(xi-,n) as defined in Eq. (fTSl) . Once the n-th Markov chain 






( 19 ) 


m—Nb + 1 


has been generated, the last N are extracted to obtain the empirical approximation of the filtering distribution: 


P(xnlm:n) ^ ^ 


JV+Nt 


N 


( 20 ) 


m—Nb-\-l 


Let us firstly remark that due to the sequential nature of the problem, the elements in the Markov chain at time n 
corresponding to the state at previous time steps that have to be generated (i.e. have to be chosen from the 

discrete set [X^_^ i-n-ifm-N +i' discrete set has been obtained from the previous time step of the algorithm 
and corresponds to the empirical approximation of the previous posterior distribution in Eq. (fT^ . In 

HMM models, it is important to note that if we are only interested in approximating the filtering distribution, only 
{X^_^ ra-r/m-jv +1 stored from previous time step. 

In ITtI , the authors suggest that one can continue, at time step n, to add samples to the previous L Markov chains 
(line 8 of Algorithmic, i.e. Xn-L,i-.n-L with L > 1 in order to improve successively the empirical approximation of 
previous posterior distributions, and especially which is required in the posterior distribution of interest 

at time step n. 

By assuming that the computational cost of a single iteration of the MCMC kernel used is 0{Bd) (where the index 
d is used to indicate that the cost of such a MCMC kernel is generally a function of the dimension of the model 
under study), the cost of this algorithm is 0(nNBd) since the length of the burn-in period is generally considered to 
be a percentage of the useful samples, i.e. Nb = fiN with 0 < /3 < 1. Let us finally remark that in lfT4l . the authors 
designed an SMCMC that directly targets the filtering distribution, i.e. the marginal distribution of the one defined in Eq. 
(fTSl l. However, as discussed in lfT3l . the computational cost of this strategy is 0{nN‘^Bd) which can therefore become 
excessive as the number of samples N increases, owing to the need to compute at each iteration of the SMCMC a sum 
of N terms which corresponds to the Monte-Carlo approximation of the predictive posterior distribution in Eq. ®. 


B. Discussion on the choice of the MCMC Kernel for high dimensional SMCMC 

The overall performance of the SMCMC algorithm applied to optimal filtering depends heavily upon the choice of the 
MCMC kernel. One of the attractive features of this SMCMC is to be able to employ all the different MCMC methods 
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Algorithm 2 Generic Sequential MCMC algorithm for optimal filtering 

1: if time n = 1 then 
2: for j = 1,N + Ni, do 

3: Sample ^ ~ ■) with an MCMC kernel of invariant distribution 7ri(a:i) oc gi{yi\xi)^{xi). 

4: end for 

5: else if time n >2 then 

6: for j = 1,..., N + N), do 

7: [OPTIONAL] Refine empirical approximation of previous posterior distributions as described in (m 

8: Sample ■) with JC^ an MCMC kernel of invariant distribution defined in Eq. GD. 

9: end for 

10: end if 

11: Output: Approximation of the smoothing distribution with the following empirical measure: 

^ JV+JVfc 

7r(xi:„) Ri — V 5 j (dxun) 

N . , ^n.,l:n 

J=N^ + 1 


that have been proposed in the scientific literature. All the existing SMCMC algorithms that have been proposed in 
the literature iflSl - lflTl utilize a Metropolis-Hastings (MH) kernel ll28l which is described in Algorithm [3 The first 
observation about this MH kernel is the flexibility offered to the user in choosing the proposal distribution q, but this 
choice is crucial as it determines the performance of the algorithm. In this section, we discuss on how such a kernel 
can be chosen. 


Algorithm 3 Generic Metropolis-Hasting Algorithm as (j-th iteration) 


1: Require: 

2: Generate {X* ~ g(xi-„\Xl~^) 

. ^ .. , . A 

3: Compute the MH acceptance probability p = mm 1,- - —^^- ^^—= - 

4: Generate z ~ W(0,1) and set it z < p, otherwise 


1) Optimal Independent MH Kernel: 

In most of the existing SMCMC algorithms, an independent MH kernel is used. In such a kernel, the proposal is 
independent of the current value of the Markov chain, i.e. 

q{xi,n\XlT^)=q{xv.n). ( 21 ) 

In this context, a natural optimal choice consists in using the following proposal distribution; 

q{xi.,n) = 7f„(xi:„), 

Nb+N 

gn{yn\Xn)fn{Xn\Xn-l) ^ 5x[:[[_^^^.^_^{dxi,n-l), 

m=Nb + l ^ ^ 

Nb+N 

(Xp{Xn\yn,Xn-l) ^ X[[[_^^^.^X^Xi,n-l) ■ 

m—Nh + 1 

from which a sample can be obtained by following these two steps: 
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1) Generate X* ~ -7 

Z^j=Art + l P(,2/rapn-l — -^n-l,n-l) 

2) Generate X* „-p(a;„|yn,X* „_i) 

Since the proposal corresponds to the target distribution, every sample will be accepted. It is interesting to remark 
that using this proposal within the SMCMC will lead to an algorithm that is exactly equivalent to the fully adapted 
Auxiliary Particle filter proposed in ll29l and analyzed in details in ll^ . Unfortunately, it is generally impossible in 
most scenarios both to sample from p{x„\yn,Xn-i) and to evaluate p(j/„|a;„_i) = gn{yn\xn)fn{xn\xn-i)dxn- 
2) Approximation of the optimal independent MH Kernel: 


A first possible strategy could therefore consist in approximating the optimal independent MH kernel by using the 
following two steps: 

1) Generate ~ Em(da;i:„_i) 

2) Generate „ ~ X* „_i) 

By using this proposal, the MH acceptance probability is given by: 

^ A 9n{yn\Xl^)fn{X:^n\Xln-l) 

I /5(-^n,l:n-l)9n(^n,n|yn,X* 

(23) 

^ /3(-^n,l:n-l)9n(X^ \ 

"" 5n(j/n|XCn )/„(XCn I^EEi) ) ’ 

The idea is of course to choose /3(X^_j^ i-n-i) Qn(xnlyn, X* „_i) to be as close as possible to 

p(ynlXn-l = 

and p{xn\yn, X* „_i), respectively. One solution, which has been also used in the SMC literature and more especially 
in the framework of the auxiliary particle filter 1291 , is to utilize for example. 


ld{Xn-ip-.n-l) 9n {yn\Xn 


E/„ [X„|X„_i 




(24) 


which corresponds to the likelihood evaluated at the prior predictive mean. Then, in order to design QniXf^^n^ \yn, 
one can use a local optimization techniques such as a Laplace approximation centered around the mode of p(x„|y„, X* 
or a local linearization of the state-space model - see a for details. Nevertheless, it can be difficult to approximate 
accurately this optimal proposal distribution in complex and high-dimensional problems. 

3) Independent MH Kernel based on prior as proposal: 


The simplest alternative choice is to design a proposal based on the combination of both the prior distribution and 
the empirical approximation of the previous posterior distribution, i.e. 


j Nb+N 

q{xi,n) = fn[Xn\Xn-l)^ ^ 5, 


(25) 


from which a sample can be obtained by following these two steps: 

1) Generate ~ EmJ'i'^-ri 

2) Generate X* „ -/„(a;„|X* „_i) 
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With this proposal, the MH acceptance probability is simply given by the ratio of likelihoods; 


p = min 


5n(j/n|^ri"n) / 


(26) 


However, since the proposal of the current state Xn is based only on the prior information, the acceptance rate of 
this MH kernel could be very low thus leading to a very poor estimate, especially for complex target distributions or 
high-dimensional systems. 

4) Composite MH Kernels: 


Rather than building a proposal from scratch or utilizing a parametric approximation since it is unlikely to work for 
high dimensions or complex target distribution, another solution would consist in gathering information about the target 
stepwise, that is, by exploring the neighborhood of the current value of the Markov chain. Indeed, the use of a local 
proposal, such as random-walk MH kernel, that are less sensitive to the class of target distribution (HMM model) than 
a “global” proposal, such as that used in independent MH kernel, could potentially lead to more efficient algorithms 
which are also simpler to implement. It is important to note that the possibility of using such local moves is an appealing 
feature of this SMCMC methods compared to traditional SMC methods. Nevertheless, the main challenging difficulty 
in high-dimensional problems is to design an efficient local or global proposal. 

As a consequence, in ini, the authors propose to use instead composite MCMC kernels based on joint and conditional 
draws which has shown to be more efficient in high-dimensional systems f?!, ifTSl . ifTSl . OTIl . Summarized in Algol?) 
such a composite kernel is based on the following two main steps: 

1) A joint draw in which a Metropolis-Hastings sampler is used to update all the path of states corresponding to xi-n 

2) A rehnement step in which previous history and current state Xn are updated successively. Moreover, if 

Xn is high-dimensional, an efficient way to update it consists in firstly partitioning it into P disjoint sub-blocks 
and update them successively either via a random scan or a deterministic scan using a series of block MH-within- 
Gibbs steps. 

The proposal distribution used in lines 2, 6 and 11 could either be local or global based on random walk MH or 
independent MH, respectively. Let us remark that the rehnement step consists in updating the state Xi:n in blocks. As 
a consequence, if one can draw the sample from the following appropriate conditional distributions 


fnjxu = Xl Jxn-1 = 


Si fniXn — Xi^„\Xn-l — 


and for the subset Hp of the current state. 


^n,ni^p) ~ t d} \ Hp)), 

= PiXnlVn, „({1, Hp)), 


(27) 


(28) 


thus the acceptance ratios p 2 and {pR,p}p-i will be equal to 1, leading to a rehnement stage equivalent to a series 
of “perfect” Gibbs samplers 1^ . If sampling from Eq. ( |27] | can easily be done at the expense of some additional 
computational cost to compute the N probability weights, sampling from Eq. (l28l) will not generally be possible in most 
of models under study as it requires to be able to sample from posterior conditional distributions. As a consequence, the 
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proposal distribution used in these composite MCMC kernels to sample X* „(np) could be based on either conditional 
prior distributions or random-walk HD. If one wants to avoid a computational cost of 0{N) for a single iteration of 
MCMC to sample the past history Xn,i-.n-i as with Eq. dZTl i. one simple solution is to select a previous path-sample 
from a mixture in which the weights do not depend on the current value of the Markov chain i.e.; 

~ —l) (29) 

m 

so the weights, computed before running the MCMC iterations at time step n. This 

choice of proposal leads to the following acceptance ratio; 

Pi = min 1,-^-:- , -- (30) 

\ fn{Xn = Xi^n\Xn-l = j 

In II 32 II . the authors proposed to incorporate several additional attractive features of population-based MCMC methods 
m, 03 such as genetic moves and simulated annealing in order to improve the mixing of the Markov chain in complex 
scenarios, especially when the target distribution is multimodal. Such strategies could still be viewed as a composite 
MH kernel on an extended state-space lEa. 


Algorithm 4 Composite MH Kernels •) for the SMCMC 


1: Joint Draw 


2: Propose 

3: Compute the MH acceptance probability pi = min I 1, 


4: Accept = X* with probability pi otherwise set X^ = X^ 

5: Refinement 

6: Propose X^ ~ <?l(a:i:n-l |X^ 

7r r rs r KKTr . , , , ^^,n) 91 |X‘ , X^) 

7: Compute the MH acceptance probability p 2 = mm 1 






n,l:n-l = *‘1^ probability P2 ■ 


8: Accept X^ 

9: Randomly divide Xn into P disjoint blocks such that ftp = {1,..., d} and Qp fj = %Xp ^ k 

10: for p = 1,... , P do 
11: Propose X*_„(np) ~ gfl,p(a;n(r2p)|X;^ 

12: Compute the MH acceptance probability 

7r„(X;^,„(Qp), X^.„({1, ...,d}\ Hp), xC:„_i) 


pRv = min 1, 

<?fl,p(x*,„(np)|x^i^„) 

, gfl,p(X^,„(np)|X*^„(np), X^,„({1,..., d} \ Hp), xC.„_i) 


13: Accept Xn^„{^p) = X* „(r2p) with probability pR^p 

14: end for 


In this section, we described the different choices of MCMC kernel that has been used currently in the literature 
for SMCMC type high-dimensional sampling approaches and their optimal design. Unfortunately, in high dimensional 
systems with highly-correlated variables, the block sampling described as a refinement step in Algorithm 0] can be very 
inefficient. Indeed, in the presence of strong correlation, the block update using a series of MH-within Gibbs steps can 
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only perform very small movements ll28ll . As a consequence, the sampler will have a poor mixing rate thus producing 
a highly correlated Markov chain with potentially a very slow convergence rate. In the next section, we propose a new 
class of novel efficient kernels that can be utilized in a sequential setting for optimal hltering based on recent advances 
in MCMC techniques. 

IV. MCMC Kernel based on Langevin diffusion and Hamiltonian dynamics 

The objective of this section is to propose more efficient MCMC kernels that may be used within the SMCMC 
framework in order to tackle challenging high-dimensional problems. More specihcally, we describe two different MCMC 
kernel families based on Langevin diffusion and Hamiltonian dynamics. Both of these families of kernel use gradient 
information in a different way to traverse a continuous space efficiently. However as discussed in the previous section, 
due to the sequential nature of the hltering problem and the target distribution dehned in Eq. (fTSl l, the state to be sampled 
is comprised of Xn and xi-,n-i which have respectively a continuous and a discrete support i-n-iim-N +i' 

a consequence, we propose to use, as in the rehnement stage described previously at time step n and the j-th iteration 
of the MCMC, a succession of the two MH-within Gibbs steps: 

1) Sample i-n-i given using one of the different approaches described in Section UlI-BI 

2) Sample Xi^ „ given and X;^ i „_i using either Langevin diffusion or Hamiltonian dynamics based MH 

kernel. 

In this strategy, the target distribution of the second step is thus given by the following conditional posterior: 

= 'Tn(^ra|^ra,l:n—l) 9n{yn\Xn)fn{Xn\Xn—l — l)’ (31) 

For clarity purposes, the time index n on the state variable x is removed in the notation used in the rest of this section. 
A. On Langevin diffusion based MCMC kernel 

First used to describe the dynamics of molecular systems in physics IMl . the Langevin diffusion is given by the 
solution of the following stochastic differential equation (SDE) 

dX^ = ]^X\ogn{X^)dt + dB\ (32) 

It represents a process with stationary and limiting distribution tt. In this SDE, B* is the standard Brownian motion 
and V denotes the gradient operator with respect to variable X. A direct use of this SDE by using a hrst-order Euler 
discretization as in IJTI gives a proposal mechanism that creates the following Markov chain 

= 2f* + —Vlog)f(X*)-f eZ* (33) 

with ~ AA(2;|0, I^) and e the integration step size. Let us remark that other integration scheme can be used as proposed 
in ll?^ . Unfortunately, convergence of the Markov chain created by this equation is no longer guaranteed for finite step 
size e due to the introduction of an integration error. To overcome this limitation, a Metropolized version has been 
introduced in llWl which ensures convergence to the invariant measure. The so-called Metropolis Adjusted Langevin 
Algorithm (MALA) uses Eq. (l3^ as proposal distribution q{X*\X'^) followed by a standard Metropolis acceptance step 
with probability Tain{l,n{X*)q{X^\X*)/n{X^)q{X*\X^)). 
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As is common with random-walk MH algorithm when there is strong correlation between elements of x, a constant 
pre-defined covariance that reflects more accurately that of the target tt can be utilized in the proposal such as 


q{x\X'') = N [x 

leading to the “pre-conditioned” MALA BOl . 


X* -f —SVlog5f(X0 


(34) 


More recently, a promising generalization of previous algorithms has been proposed by considering a Langevin 
diffusion on a Riemannian manifold Ei-iia. The key idea is to take into account the local structure of the target 
density when proposing a move as it may greatly speed up the convergence of the Markov chain. Rather than employing 
a constant matrix as in the pre-conditioned MALA, the strategy consists in adopting a position specific covariance. This 
generalization of the Langevin SDE given in Eq. (l32l l is therefore defined as follows 


d 


withA,(X‘)=^^[G-i(X‘)]^ 


j=i 


(35) 


= -E 


G-i(X‘)^g^G-i(X‘) 


with a drift term and a diffusion coefficient that both depend on the state. The choice of this metric G(X) will be 
discussed in Section HV-CI In ll42l . it has been shown that this diffusion admits tt as invariant stationary distribution. 
The resulting MALA on manifold algorithm therefore uses the following proposal distribution 


q{x\X^) = M [x 


X*-f —G"^(X*)VlogS^(X*) 


+ -A(X*),e2G-i(X*) 


(36) 


Einally, by remarking that the elements that composed the drift term, A(X*) defined in Eq. dTSl l. are often very small, 
the authors in EH propose a simplified manifold MALA algorithm in which the proposal is given by; 


g(x|X*) = M [x 


X* -f —G-i(XOVlogi'(X*) ,e2G-i(X*) 


(37) 


This proposal can also be viewed as a generalization of the one used in the pre-conditioned MALA, in the sense 
that the covariance is no longer constant but instead becomes state dependent. Compared to the manifold MALA, the 
computational cost is reduced as the partial derivatives of the chosen metric G{x) involved in the computation of the 
drift term are no longer required. Let us mention some interesting recent work, where ll44ll proposes to use convex 
analysis rather than differential calculus, as described previously, in order to derive a novel Langevin MCMC kernel for 
log-concave distributions with interesting convergence properties. 

The proposed SMCMC algorithm that will use proposal distribution described in either Eq (l34l) . Eq (l36l) or Eq dJTl l 
will be named respectively by SMALA, SmMALA and Simplified SmMALA. 


B. On Hamiltonian based MCMC kernel 

In addition to Riemannian Langevin diffusion proposals, we described here another promising MCMC kernel based 
on Hamiltonian dynamics that we consider adapting for its use within the SMCMC framework for optimal filtering. 
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Hamiltonian dynamics was originally introduced in molecular simulation and later was used within an MCMC framework 
in B31 leading to the so-called “Hybrid Monte Carlo”. More statistical applications of Hamiltonian Monte Carlo (HMC) 
were then developed in ll46l and BtII . 

HMC is a powerful methodology to sample from a continuous distribution, tt(x) in our case, by introducing an 
auxiliary variable, q called momentum variables. In HMC, the Hamiltonian function is defined by 

H{x,q) = U{x)+F{q), (38) 

which describes the sum of a potential energy function defined as; 

U{x) =—\ogTi{x), (39) 


and a kinetic energy term which is usually defined as: 

F{q) = ^q^M-^q. 


(40) 


with M a positive definite matrix, generally chosen as an identity matrix. With these definitions, the dynamics of both 


variables with respect to a fictitious time r are given by the Hamiltonian equations: 


dx{i) 

dr 

dqi^) 

dr 


dH 

dq{i) 

dH 

dx{i) 


- 

dx{i) 


i91og7r(a:) 

dx{i) 


(41) 


Hamiltonian dynamics posses some interesting properties (energy and volume preservation as well as time reversibility 


which are described in detailed in ill), that allow its use in constructing MCMC kernel. The Hamiltonian in Eq. ( l3Ft 


defines equivalently the following joint distribution; 

7r(a:, q) cx exp {—H{x,q)) = n{x) exp 



(42) 


which obviously admits as marginal the target distribution of interest n{x). 

As summarized in Algorithm |5] each iteration of the HMC is composed of two steps. Firstly, given the value of both 
the state and the momentum obtained at the previous iteration, the first step consists in a Gibbs move that randomly draws 
a new value for the momentum variables from the conditional target distribution. In the second step, a Metropolis update 
is performed, using Hamiltonian dynamics to propose a new candidate {X*,Q*). In general, Hamiltonian dynamics, 
defined in Eq. (l4Tli are numerically simulated using a discretization method named the Leapfrog method (Algorithm 
lU ll45l . The obtained candidate (X*, Q*) is thus accepted as the next state of the Markov chain using a standard MH 
acceptance rule in order to correct the fact that the leapfrog method induces a bias. In order to avoid possible periodic 
trajectories of the HMC thus leading to a non-ergodic algorithm, it is recommended to randomly choose either the step 
size e or the the number of leapfrog steps N^p El. 

It can be shown that this HMC algorithm using a single step integrator (Npp = 1) with the Leapfrog method is 
exactly equivalent to the pre-conditioned MALA algorithm described in Section HV-AI with Eq. (l3^ . Although MALA 
can be viewed as a special case of HMC, the properties of both algorithms are quite different. As we can see from 
the construction of both kernels, the MALA is a random-walk MH adjusted by taking into account the gradient-based 
information whereas the HMC proposal involves a deterministic element based on Hamiltonian equation. As illustrated 
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Algorithm 5 Hamiltonian based MCMC Kernel for sampling ^ in the SMCMC 

1: Sample = Af {q\0,M) 

2: Propose (X*,Q*) using the Leapfrog method described in Algorithmwith {X^~^ as initial values. 
3: Compute the MH acceptance probability Phmc = {I 5 ■, Q*) + , Q ^))} 

4: Accept X^ = X* with probability Phmc otherwise set X^ = X^~^ 


Algorithm 6 Leapfrop method 

1: Input: Stepsize e, number of Leapfrog steps Nlj? and initial values (JA®, 

2: for n = 0,..., Nlf — 1 do 
3: Compute 

4: Compute = X"^ + eXqF{Q^^+^/^) = X"^ + eA4'-lQ'*''+'/2 

5: Compute - iv^C/(X"'+') 

6: end for 

7: Output: JA* = X^^lf and Q* = Q^^lf 


in na, one of the main benefits of HMC is to be able to avoid such a random-walk behavior. With an appropriate tuning 
of its parameters (N^p and e), the HMC is able to reach a state that is almost independent of the current Markov state. 
As discussed in ll48l . some asymptotic analysis of these algorithms shows that, in the stationary regime, the random-walk 
MH algorithm needs 0{d) steps to explore the state space whereas MALA and HMC needs only and 0{d^^'^), 

respectively. 

As for the MALA, the authors in HTI proposed a generalization of this HMC algorithm by considering Hamiltonian 
dynamics on a manifold in order to be able to take into account the local structure of the target distribution. The 
Hamiltonian is now defined as; 

H{x,q) = U{x) +F{q,x), 


with [/(x) = —log 7r(a;) (43) 

and F{q,x) = i log ((2^)‘'|G(a:)|) + ]^q^G-\x)q. 

The distribution associated to this Hamiltonian p{x,q) oc exp (—iJ(x, g)) still admits as marginal the desired target 
distribution of the state of interest p{x). As we can see, the kinetic energy term now depends on the state x. As a 
consequence, unlike in the previous HMC case, the Hamiltonian is no longer separable and therefore the Hamiltonian 


dynamics of each variable will now depend on both variables, i.e. 

dx{i) OH 


dr 


and 


dq{i) dH 
dr dx(i) 

dlogn(x) 1 
dx(i) 2 


9,(i) = 

dU{x) 1 i91og(|G(a:)|) 1 pdG~^{x) 


(44) 


Tr<^G-7a;) 


dx(i) 2 dx(i) 
dG(x) 


dx(i) 


2 dx(i) 
,dG(x) 1 




(45) 


To numerically simulate these Hamiltonian dynamics on a manifold, a generalized version of the Leapfrog integrator has 
to be used. The HMC on manifold based MCMC kernel is summarized in Algorithm [T] As for the HMC, this algorithm 
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produces an ergodic, time reversible Markov chain satisfying detailed balance and whose stationary marginal density is 
tt{x) ED. An interesting and rigorous discussion on the theoretical foundations of HMC kernels is presented in E9l . 
Additionally, we mention that a GPU implementation of this HMC, discussed recently in ifSOll . could greatly reduce the 
computational cost of this algorithm. 


Algorithm 7 Manifold Hamiltonian based MCMC Kernel for sampling X-j^ „ in the SMCMC 

1: Sample = M {q\0,G{X^-^)) 

2: Propose (X* ,Q*) using the Generalized Leapfrog method described in Algorithm [S] with (X^~^ , Q^) as initial values. 
3: Compute the MH acceptance probability PniHMC = min|l,exp (^—H{X*,Q*) + H{X^~^ 

4: Accept Xn^n = with probability PmHMC otherwise set X^^n = X^~^ 


Algorithm 8 Generalized Leapfrop method 


1: 

Input: Stepsize e, number of Leapfrog steps Nljp, number of fixed points Npp, and initial values (X®, Q^) 

2: 

for n = 0,..., Nx^p — 1 do 


3: 

% Update the momentum variables with fixed point iterations 


4: 

Set Q° = 


5: 

for fc = 1,. .., Npp do 


6: 

Compute with partial derivatives given in 

Eq. ED 

7: 

end for 


8: 

Set = Q^FP 


9: 

% Update the state variables with fixed point iterations 


10: 

Set X° = X^‘ 


11: 

for k = 1,..., Np p do 


12: 

Compute X*^ = X"^ + - ^\/qH{X"‘ 

)j with partial derivatives given in Eq. i4-4\ 

13: 

end for 


14: 

Set = X^FP 


15: 

% Update the momentum variables exactly 


16: 

Compute 0'*"+' = 


17: 

end for 


18: 

Output: X* = X^^FF and Q* = Q^^ff 



The proposed SMCMC algorithms, that we will utilize in the examples, use either an HMC Kernel (Algo. |5]l or 
Manifold HMC kernel (Algo. |2ll and each choice will be named respectively by SHMC, SmHMC. 

C. Choice of the tensor metric G{-) 

As suggested in ED and B3l . a natural choice for this metric is to take into account the local structure of the target 
distribution by using information from its hessian, i.e. 

G{xn) = -A“" log7f(a;„), (46) 

where := Va;„ is the second derivative operator. If the target distribution is non-Gaussian, the negative Hessian 
will be state dependent and its use within either mMALA or mHMC kernel will allow the algorithm to take into account 
the local curvature of the target distribution. However, one major issue with this choice results from the fact that unless 
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the target distribution is log-concave, this negative Hessian will not be globally positive-definite. To overcome this 
limitation, authors in El propose to use a technique, named SoftAbs, based on a smooth absolute transformation of 
the eigenvalues that maps this negative Hessian metric into a positive-definite matrix in a way that the derivative of this 
transformed metric (required in both the SmMALA and SmHMC) is still computable. 

An alternative strategy, used in IItTI . consists in choosing G{xn) as a Fisher metric. In our context of filtering, this 
metric will be defined as: 


G{Xn) ]Ey^|x„ [^a:„ l^ri)] 


(47) 


- log/„(a:„|x„-i = 

which corresponds to the expectation over the data of the metric defined previously in Eq. (|46]) . If such expectation is 
analytically tractable, this metric is guaranteed to be positive-definite as long as the prior is log concave and therefore 
will constitute a suitable metric for both SmHMC and SmMALA. However, if the prior distribution is not log-concave 
(as in the problem we propose to tackle in Section IV-BI ), one can use the SoftAbs technique of Ell described before 
to render this metric positive-definite. Nevertheless, in this paper, we propose a simpler alternative which consists in 
approximating (just for the computation of this metric) the prior distribution with a multivariate normal distribution: 


G{xn) — [^a;” log Pn(j/n |a:n)] log A/'(a:n; 

= [A“" log5„(j/„|a;„)] -f , 


(48) 


where E„ = Var/,^ yXn\Xl^ n-i j covariance matrix of from the true prior distribution. 

By using such a strategy, the derivative of G{xn) required in both SmMALA and SmHMC will depend only on the 
derivative of the first term, i.e. 


dG{Xn 


d 


[KZ0ggn{yn\Xn)\ 


(49) 


dxn{i) dxn{i) 

As a consequence, this proposed metric does not require any additional parameters to be tuned and is clearly less 
computationally demanding than the SoftAbs. 


V. Numerical Simulations: Large Spatial Sensor Networks 

In this section, we study the empirical performance of the proposed sequential Langevin and Hamiltonian based MCMC 
algorithms in a challenging high-dimensional problem. In particular, we address the estimation of a complex physical 
phenomena from a collection of noisy measurements obtained by a large network of spatially distributed sensors. Such 
sensor networks have attracted considerable attention due to the large number of applications, such as environmental 
monitoring ll52l . Il53l . weather forecasts El, surveillance ll55ll . health care ll5^ . ... These sensors typically monitor a 
spatial time-varying physical phenomenon containing some desired attributes (e.g pressure, temperature, concentrations 
of substance, sound intensity, radiation levels, pollution concentrations, seismic activity etc.) and regularly communicate 
their observations to a Fusion Center. This fusion center collects these observations and fuses them in order to reconstruct 
the signal of interest at the current time, based on which effective actions can be made. As a consequence, it is of great 
interest to study how accurately these Monte-Carlo algorithms are able to track the time evolution of such a high¬ 
dimensional physical field. 
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More specifically, in this section, we consider a time-varying spatially dependent continuous process defined over 
a 2-dimensional space which is observed sequentially in time by d sensors deployed over a 2-D monitoring region. 
Each sensor therefore collects, independently of each other, at time n some noisy information about the phenomenon 
of interest at its specihc location, i.e. \/k = 1,... ,d: 

Yn{k)\Xn{k) = Xn{k) ~ gn{yn{k)\Xn{k)) (50) 

The physical location of all sensors, denoted by Sk € with k = d}, is assumed to be known by the fusion 

center. Therefore, the objective is to estimate at time n, the value of the physical phenomenon Xn G R'^ at these 
d different sensor locations given their measurements from time 1 to n (i.e. j/i,...,y„). In this paper, in order to 
model the spatial and temporal dependence of the physical process of interest, we consider the following multivariate 
Generalized Hyperbolic (GH) distribution ll57l as prior distribution: 

f„{Xn\Xn-l) (xKx-d/2 {\/(X + Q(a^«))(V’ + 

(51) 

\/(x + Q{xn)){iJ + 7^S-i7)" 

where Q{xn) = {xn — yn)'^Y~^(xn — /in) and fin = OiXn-1 G R"^ is the location parameter with a G R. K\ denotes 
the modified Bessel function of the second kind of order A. The prior distribution for the hrst time step is defined as 
fi{xi) = /i(a;i|a;o = 0). The parameters A, x and iji are scalar values that determine the shape of the distribution. 
E G is the dispersion matrix and the vector 7 G R^^ is the skewness parameter. This multivariate generalized 

hyperbolic family is extremely flexible and has received, until now, a lot of attention more in the financial-modeling 
literature 1581 . Indeed, this distribution allows to take into account heavy-tailed and asymmetric data, which could be very 
benehcial in modeling some physical process with extremal behavior. Moreover as illustrated in Fig. |2l this distribution 
contains many special cases known by alternative names: normal, normal inverse Gaussian, skewed-t, etc. ISl 

In our simulation, the dispersion matrix of this multivariate GH distribution is a positive definite matrix and is dehned 
such that the degree of the spatial dependence in the process increases with the decrease of the separation between two 
locations, i.e. 

P]y = aoexp 

with II ■ II 2 the L2-norm and Sij the Kronecker symbol. 

In the following examples, the proposed sequential Langevin and Hamiltonian based MCMC algorithms will be 
compared to three different variants of SMC-based algorithms: standard SIR algorithm, the block SIR | 6 l (with a block 
size of 4) and a Resample-Move algorithm, denoted by SIR-RMiX, for which K MCMC moves with the mHMC kernel 
described in Section IIV-BI is applied on each particle (for x„ - i.e. L = 1) after the resampling stage. An SMCMC 
approach with a composite MH kernel described in Algorithm |4] with (conditional) prior distributions as proposals, 
denoted by SMCMC-Prior, is also studied. The rehnement step of the state at the current time, Xn, is also performed 
with a random partitioning of size 4. The refinement step of xi-.n-i for all SMCMC-based approaches utilized the 
empirical approximation of the previous posterior distribution as proposal distribution. As already observed in a static 
problem in which the state of interest is high-dimensional and highly correlated, the SMALA was unable to perform 
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(b) (c) 

Fig. 2: Illustration of few distributions from the Generalized Hyperbolic family with [E ]]^2 = Pin = [^]22 = 
/ 2 ( 1 ) = /r(2) = 0, A = —V12, x = V ip ^ 0. (a): bivariate Normal distribution (7 —!■ 0 and 1 / — >■ 00) - (b): Bivariate 
multivariate t distribution ( 7^-0 and 1 / = 3) - (c); Bivariate GH skewed-f distribution ( 7 ( 1 ) = 7 ( 2 ) = 2 and v = 3) 
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well - see details in 1591 . N^p = 20 steps have been used in the classical Leapfrog integrator in Algo. |6](and Npp = 10 
and Npp = 2 for its generalized version in Algo. [8]). Finally, as suggested in BTI . the stepsize e was tuned such that the 
acceptance ratio was between 40% — 70% (and 70% — 90%) for the sequential Langevin (Hamiltonian) based MCMC. 
These values are based on some theoretical analysis on the optimal acceptance rate - see l60l . Some adaptive procedures 
such as ED-im can also be used. Let us mention that some MCMC convergence diagnostics, such as 1641 for example, 
can be used to adaptively find the length of the burn-in period. In the following experiments, we set Nb = O.IN. 

All the algorithms were implemented in the interpreted language MatlatQ and simulations were run on a single Intel 
Core i7 2.6GHz with 16GB of memory. 


A. Example 1: Dynamic Gaussian Process with Gaussian likelihood 


In this first example, we consider the simplest special case of the GH family, the multivariate normal distribution. 
Moreover, we consider that each sensor measures the physical process of interest with some Gaussian random noise, 
thus leading to the following HMM; 


(53) 

9n(^yn\^n) — ijjni Xm ^ 

with Tiy = a-yldxd- For the experiments, we fix the model parameters as a = 0.9, cr^ = 2 and (ao = 3, ai = 0.01, 
(3 = 20) for constructing the dispersion matrix in Eq. (l52l l. Moreover, the sensors are uniformly deployed on the grid 

4 X 4 . 

Such a model is interesting for the understanding and the study of approximation methods since the posterior 
distribution can be derived analytically via the use of the Kalman filter IH. Moreover for this model, the SMCMC 
algorithm with the optimal independent MH kernel (equivalent to the fully adapted Auxiliary particle filter) described 
in Section IIII-BII can be used as a benchmark since all the different distributions required for its implementation can 
be derived analytically. For the proposed SmMALA and SmHMC, we use a metric derived from Eq. (l47l) . i.e. 


= + (54) 

with 'Ey = CTyldxd- Since this metric does not depend on the state, the Simplified SmMALA is equivalent to the 
SmMALA (since the drift of the SmMALA is zero). Moreover, for the same reason, the Hamiltonian dynamics on 
manifold expressed in Eq. (l43T l is separable as F{q,x) = F{q) does not depend on the state. As a consequence, the 
classical Leapfrog integrator can be used for the SmHMC. 

Figure |3 shows the bias and the variance of the posterior mean estimator obtained by the different algorithms across 
the d = 64 dimensions of the state at several time steps. From these results, we can clearly see that a significant 
degradation of the performance occurs when the standard SIR algorithm is employed compared to the SMCMC-Optimal 
as only prior information is used to sample the particles. Compared to the SIR, the block SIR (with blocks of dimension 
4) clearly allows to decrease the variance of the estimator but at the expense of an increase of the bias. Indeed, this 
effect is due to the approximation of the posterior as a product of marginals on each block and is well known for this 


'Codes are available at http://pagesperso.telecom-lille.fr/septier/software.html 
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Fig. 3: Evolution of the mean (± standard deviation) of the error between the posterior mean obtained by the different 
algorithms and the true one (obtained by using Kalman equations) across the d = 64 dimensions that composed the 
state and at different time step (results are obtained with 100 runs on the same data set - iV = 200). 


technique 171, ifTTl . More importantly, unlike all the other methods (SIR and SMCMC-based ones), this bias will 
never tend asymptotically (with the number of samples N) to zero as long as the block size is less than the dimension 
of the state. Finally, we can see that the proposed SmHMC algorithm clearly outperforms both the SIR and the Block 
SIR by providing an estimator of the posterior mean with a small bias and variance, close to the SMCMC-Optimal. It 
should be noted that the SmMALA gives results similar to the one of the SmHMC. 

In Fig. m the log-relative mean-squared error (MSE) of the posterior mean between the Monte-Carlo algorithms and 
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Fig. 4: Log relative (to the optimal one given by Kalman equations) Mean squared error (average over time) for the 
different algorithms as the dimension of the state, d, increases. (N = 200). 


the Kalman filter is depicted as the dimension of the state to infer increases. The proposed SmHMC and SmMALA give 
similar performances and outperforms significantly the other sequential techniques. Moreover, their performances remain 
very close to the one obtained with the SMCMC-Optimal even when the dimension of the state becomes quite large. 
The block SIR outperforms the standard SIR when d > 20. As discussed previously, the reduction of the variance with 
the block SIR compared to the SIR becomes more beneficial as d increases even if a bias is introduced. The use of one 
MCMC move on each particle within the SIR (SIR-RMl) allows to improve the performance of the SIR. Nevertheless, 
we can clearly see that the use of mHMC kernel within the SMCMC framework provides the best performances results 
compared to its use within the SMC framework. Table |T] shows the log relative MSE and the computation time per time 
step for these two different use of the mHMC kernel. As expected, the performance of the SIR-RM increases with the 
number of MCMC moves applied on each particle within the SMC but at the expense of an increased computational 
cost. However, even with 3 moves, the SmHMC outperforms the SIR-RM3 with a computational cost three times less. 
As discussed previously, the problem with the SIR-RM algorithm is that as d increases only one unique particle (with 
non-zero weights) is duplicated N times by the resampling step. Therefore, more MCMC moves are required in order 
to obtain a satisfactory empirical approximation of the posterior distribution. Figure |3 illustrates the time evolution of 
the MSE which remains stable for large n. The proposed SmHMC still outperforms its competitors at a larger time 
horizon. 

Fig. |6] shows the number of particles N required in the SIR algorithm and its associated computation time in order 
to obtain the same performance of the SmHMC in terms of MSE. As discussed previously with Fig. H the MSE of 
the SmHMC being almost constant with d, the number of particles required in the SIR explodes exponentially with 


October 30, 2015 


DRAFT 










25 


Method 


Dimension d = 144 Dimension d = 400 

Time [sec.] Rel. MSE [log] Time [sec.] Rel. MSE [log] 


SmHMC 

1.54 

0.20 

15.65 

0.21 

SIR-RMl 

1.35 

0.71 

14.10 

1.34 

SIR-RM2 

2.60 

0.28 

30.01 

0.62 

SIR-RM3 

3.98 

0.25 

42.09 

0.26 


TABLE I: Comparison of the log relative (to the optimal one given by Kalman equations) mean squared error averaged 
over the 100 Monte-Carlo algorithms and 10 time steps and the associated computation time per time step for the 
SmHMC and the SIR-RM with different number of MCMC moves after the resampling stage (N = 200). 
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(a) dimension d = 144 


(b) dimension d = 400 


Eig. 5; Time evolution of the mean squared error in log (results are averaged over 25 runs - N = 200). 


the dimension of the state to infer, see discussions in a, m. As a consequence, the computational time grows 
exponentially for the SIR and we can see that in order to reach similar MSE performances the computational time of 
the SIRis significantly higher than the one of the proposed SmMALA and SmHMC, especially as d becomes large. The 
SmHMC is slightly more computationally demanding than the SmMALA, due to the use of the Leapfrop integrator with 
Nlp steps. Let us finally remark that since the Block SIR introduces some bias by construction, it was not possible to 
reach with this algorithm the MSE performances obtained with the proposed SmHMC. 

In Table HI] we compare the relative efficiency of these different methods by calculating the effective sample size 
(ESS) using the posterior samples for each dimension of the state. 


ESS = 


N 

1 + 


(55) 
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Fig. 6: Study of the number of particles N required in the SIR algorithm (left) and its associated computation time 
(right) in order to obtain the same performance of the SmHMC in terms of MSE (as shown in Fig. |4]i 


Method 

Time 

(sec.) 

ESS 

(Min.,, Med., Mean, Max.) 

MeanESS 

Time 

SMCMC-Prior 

25.78 

(3, 8, 9, 31) 

0.35 

SmMALA 

2.13 

(15, 47, 48, 86) 

22.54 

SHMC 

2.83 

(26, 80, 80, 141) 

28.27 

SmHMC 

3.71 

(42, 128, 130, 243) 

35.04 


TABLE II: Comparison of the different MCMC kernels in terms of Effective sample size (ESS) and computation time 
per time step (d = 144 with N = 500). 


where N is the number of posterior samples (after the Burn-in period) and ^ monotone sample 

autocorrelations as estimated by the initial monotone sequence estimator of 1^ . The ESS estimates the reduction in the 
true number of samples, compared to iid samples, due to the autocorrelation in the Markov chain. The reported values 
in this table correspond to the minimum, median, mean and maximum ESS values across the d dimensions of the state 
averaged over the 10 time steps and 100 Monte-Carlo runs. The mean ESS is then normalized relatively to the CPU 
time required to produce the Markov chain of length Ni, + N at each time step. Results in Table HI] clearly show that 
the SMCMC-Prior performs very poorly. Indeed, the sampler uses a series of MH-within Gibbs to update the curi'ent 
state by blocks and thus producing a highly correlated Markov chain. Moreover, its computation time is very high due 
to number of loops required to perform the 144/4 block updates at each iteration. The use of Hamiltonian dynamics 
in the SMCMC clearly allows to achieve the largest ESS values. The use of Riemannian manifold within the HMC 
provides some improvements in terms of ESS compared to a classical HMC at the expense of additional computation 
time. Nevertheless, the SmHMC gives the best performances with the ESS normalized by the computation time. 
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B. Example 2: Dynamic Skewed-t process with count observations 

In this second example, we consider a high-dimensional non-linear and non-Gaussian state-space model in which each 
sensor collects count data, so that the likelihood is defined as 

d 

9n{yn\xn) = “Po (^/n(^); exp(m2Xn(A:))) (56) 

k^l 

Each measurement is Poisson distributed with mean mi exp(m 2 a;„(fc)) (mi = 1 and m 2 = 1/3 in the experiments). 
The prior distribution describing the spatial and temporal evolution of the physical phenomenon is the multivariate GH 
skewed-f distribution defined by Eq. (fSTT l with A = —v 12, x = ^nd ip ^ 0. For the experiments, we fix the model 
parameters as a = 0.9, ay = 2, 1 / = 7, {7(fc)}fc=i = 0.3 and (ao = 3, ai = 0.01, /3 = 20) for constructing the 
dispersion matrix in Eq. (|52] |. 

Since in this scenario the prior is non-log concave, we use the proposed metric based on a Gaussian approximation 
of the prior, defined in Eq. ( |49] |. which is given as a consequence by: 

G(x„) = A(:r„) + E-i (57) 

where A{xn) is a diagonal matrix with elements [A(a;„)]j, = mim^ exp(m 2 a:„(fc)). Moreover, from the property of the 
multivariate GH skewed-f distribution, its covariance is given by El as > 4: 

E = Var,. = ^E + ,,, 77^ 

Unlike in the previous example, since the metric depends on the state, the generalized Leapfrog integrator has to be 
used for the SmHMC and moreover the drift term in the SmMALA is not equal to zero (so the Simplified SmMALA 
is not equivalent to the SmMALA). 

Table |III] shows the MSE obtained on average at each sensor location. The use of the proposed Langevin and 
Hamiltonian based MCMC kernel clearly allows a significant improvement and more importantly their associated MSE 
are quite stable with the dimension of the state to infer. These results also shows the benefit of using such MCMC 
kernel within the SMCMC framework (SmHMC) compared to its use within the SMC (SIR-RM). 

We compare in Table |IV] the ESS of the different Sequential MCMC methods. Unlike in the previous example, 
the computational time of both the SmMALA and the SmHMC is larger since the derivative of the metric has to be 
computed at each iteration of the MCMC. On the one hand, the SmMALA obtains slightly better ESS than its simplified 
version since proposed steps across the manifold will have greater error by not fully taking into account changes in 
curvature (with the drift term). The ESS normalized by time however is much better for the Simplified SmMALA, as 
the computational complexity is far less. On the second hand, the SmHMC clearly gives the best ESS and illustrates 
that this technique is very efficient to sample from this challenging posterior distribution. Despite its higher computation 
time, the ESS normalized by time is also better for this SmHMC when d = 400. 

Finally, Fig. Q shows the estimated posterior mean and variance of the state at few time steps for the different 
sequential techniques. All the proposed SMCMC-based approaches are clearly able to reconstruct the signal of interest 
from the data. Unlike the Block SIR which fails completely to estimate the posterior variance (owing to the product 
approximation of the posterior that is the basis of this technique), the proposed techniques provide reasonable and 
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Method 


Dimension d 



144 

400 

1024 

SIR 

4.95 

8.87 

12.17 

SIR-RMl 

0.88 

1.13 

2.74 

SIR-RM2 

0.66 

0.82 

1.62 

SIR-RM3 

0.65 

0.68 

1.36 

Block SIR 

1.29 

1.48 

1.55 

SMCMC-Prior 

1.68 

3.35 

5.23 

Simplified SmMALA 

0.61 

0.79 

0.91 

SmMALA 

0.60 

0.76 

0.88 

SHMC 

0.63 

0.69 

0.77 

SmHMC 

0.55 

0.58 

0.65 


TABLE III: Comparison of the mean squared error obtained at each sensor location on average over the 100 Monte-Carlo 
algorithms and 10 time steps for several dimension configuration d (N = 200). 


satisfactory estimation of this posterior variance. Indeed, we expect that there is more uncertainty in the estimate where 
there is less data. Owing to its capacity to explore the space which has been demonstrated empirically with its ESS, 
the SmHMC seems to give a more robust estimation of both posterior mean and variance value across space and time. 

VI. Conclusion 

In this paper, after describing the optimal filtering problem in a general HMM, we provide a unifying framework of the 
sequential Markov Chain Monte Carlo algorithms which constitute a promising alternative to traditional sequential Monte- 
Carlo methods. In particular, the choice of MCMC kernels are discussed in order to provide guide for practitioners. More 
importantly, we propose new efficient kernels adapted to this SMCMC framework in order to increase the efficiency of 
such approaches when dealing with high-dimensional filtering problems. Through two challenging examples, the results 
empirically show a significant improvement when such proposed sequential Langevin or Hamiltonian based methods 
are utilized. We have empirically demonstrated that the use of such MCMC kernels within the SMCMC framework 
clearly provides better performance results compared to their use within the SMC framework as with the resample-move 
algorithm. Those techniques pave the way to a renewed consideration of Monte-Carlo based techniques for Bayesian 
filtering in complex and high-dimensional systems. 
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